Tentative implementation for testing prompts in chatbot (WIP)#4253
Tentative implementation for testing prompts in chatbot (WIP)#4253robinpaul85 wants to merge 13 commits intollm.chatbot.prototype.2from
Conversation
a41951e to
3d4af90
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a WIP pathway for automatically running chatbot “test prompts” against datasets, wiring a new testchat CLI flag into server startup and refactoring the existing chatbot unit-test script into a callable function.
Changes:
- Add
testchatargv handling inserver/src/app.tsto iterate datasets and invoke chatbot tests. - Refactor
server/routes/chat/test/chatUnitTests.tsto exporttest_chatbot_by_dataset()and add matrix-output validation. - Enable chatbot test data for the TermdbTest dataset and adjust server package/dependency versions.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| server/src/app.ts | Imports and triggers chatbot tests via a new testchat CLI argument during launch. |
| server/routes/chat/test/chatUnitTests.ts | Exposes a dataset-scoped test runner and adds validation for additional chatbot actions. |
| server/package.json | Changes server workspace version and several internal dependency versions. |
| server/dataset/termdb.test.ts | Adds queries.chat.aifiles to point to chatbot test data JSON. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
server/src/app.ts
Outdated
| for (const genome of Object.values(genomes)) { | ||
| for (const ds of Object.values(genome.datasets)) { | ||
| if ((ds as any)?.queries?.chat) { | ||
| test_chatbot_by_dataset(ds) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The testchat branch calls test_chatbot_by_dataset(ds) without await, and handle_argv() doesn’t return an exit object afterward. This means the server can continue to start listening while tests run in the background, and async failures can become unhandled rejections. Await the test runs, aggregate pass/fail, and return an appropriate { code } (or exit) so testchat behaves like a real CLI test command.
| for (const genome of Object.values(genomes)) { | |
| for (const ds of Object.values(genome.datasets)) { | |
| if ((ds as any)?.queries?.chat) { | |
| test_chatbot_by_dataset(ds) | |
| } | |
| } | |
| } | |
| let hadFailure = false | |
| for (const genome of Object.values(genomes)) { | |
| for (const ds of Object.values(genome.datasets)) { | |
| if ((ds as any)?.queries?.chat) { | |
| try { | |
| const result: any = await test_chatbot_by_dataset(ds as any) | |
| // If the test function returns a result object indicating failure, mark it. | |
| if ( | |
| result === false || | |
| (typeof result === 'object' && | |
| result !== null && | |
| (('success' in result && !result.success) || | |
| ('ok' in result && !result.ok))) | |
| ) { | |
| hadFailure = true | |
| } | |
| } catch (e) { | |
| hadFailure = true | |
| } | |
| } | |
| } | |
| } | |
| return { code: hadFailure ? 1 : 0 } |
| const testing = true // This causes raw LLM output to be sent by the agent | ||
| const llm = serverconfig.llm | ||
| if (!llm) throw 'serverconfig.llm is not configured' | ||
| if (llm.provider !== 'SJ' && llm.provider !== 'ollama') { | ||
| throw "llm.provider must be 'SJ' or 'ollama'" | ||
| } |
There was a problem hiding this comment.
This module runs serverconfig.llm validation at import time and throws if missing/unsupported. Once this file is imported by the server (e.g. via src/app.ts), that becomes a server-startup side effect even when not running chatbot tests. Move these checks inside test_chatbot_by_dataset() (or the dedicated CLI entrypoint) so importing the module can’t crash the server.
| export async function test_chatbot_by_dataset(ds: any) { | ||
| // Check to see if the dataset supports the AI chatbot | ||
| //console.log("dataset:", dataset.aifiles) | ||
| if (!(ds as any)?.queries?.chat.aifiles) throw 'AI dataset JSON file is missing for dataset:' + ds.label | ||
| const aifiles = (ds as any)?.queries?.chat.aifiles | ||
| const dataset_json = await readJSONFile(aifiles) // Read AI JSON data file | ||
| //console.log("dataset_json:", dataset_json) | ||
| const aiFilesDir = path.dirname(aifiles) | ||
| for (const test_data of dataset_json.TestData) { | ||
| const test_result = await run_chat_pipeline( | ||
| test_data.question, | ||
| llm, | ||
| serverconfig.aiRoute, | ||
| dataset_json, | ||
| testing, // This is not needed anymore, need to be deprecated | ||
| serverconfig.tpmasterdir + '/' + dataset_json.db, | ||
| serverconfig.tpmasterdir + '/' + dataset_json.genedb, | ||
| ds, | ||
| aiFilesDir | ||
| ) | ||
| console.log('test_result:', test_result) | ||
| if (test_result.action == 'html') { |
There was a problem hiding this comment.
test_chatbot_by_dataset() currently only logs mismatches and continues; it never signals failure to the caller (no return value/throw), so an automated run can succeed even when prompts fail. Consider returning a boolean/result summary and throwing (or setting a non-zero exit code in the caller) when any test case fails.
| console.log( | ||
| 'html resource request did not match for prompt' + | ||
| test_data.question + | ||
| '. LLM response :' + | ||
| test_result.response + | ||
| ' Actual response: ' + | ||
| test_data.answer |
There was a problem hiding this comment.
The mismatch log messages concatenate for prompt directly with the question text (missing a separating space/colon), which makes logs harder to read/search. Add a delimiter (e.g., "for prompt: ") consistently for html/summary/dge/matrix cases.
| activeTracks: ['bw 1', 'bed 1'] | ||
| }, | ||
| chat: {} | ||
| chat: { aifiles: './proteinpaint/server/dataset/ai/termdb.test.json' } |
There was a problem hiding this comment.
The aifiles path is ./proteinpaint/server/dataset/ai/termdb.test.json, but this repo’s file lives at server/dataset/ai/termdb.test.json. Since readJSONFile() reads paths relative to process.cwd(), running from the server/ workspace will not find ./proteinpaint/... (there is no server/proteinpaint/ dir). Update this to a path that resolves correctly from the server workspace (or make it absolute via import.meta.url).
| chat: { aifiles: './proteinpaint/server/dataset/ai/termdb.test.json' } | |
| chat: { aifiles: './dataset/ai/termdb.test.json' } |
| "name": "@sjcrh/proteinpaint-server", | ||
| "version": "2.177.1-0", | ||
| "version": "2.177.0", | ||
| "type": "module", |
There was a problem hiding this comment.
The server package version was changed from a 2.177.1-0 workspace-aligned version to 2.177.0, which makes it inconsistent with the repo/workspace versions (root/client/front/container packages remain 2.177.1-0). Unless this is an intentional coordinated version rollback across the monorepo, revert this to keep workspace versions in sync.
server/package.json
Outdated
| "@sjcrh/proteinpaint-python": "2.174.0", | ||
| "@sjcrh/proteinpaint-r": "2.152.1-0", | ||
| "@sjcrh/proteinpaint-rust": "2.177.1-0", | ||
| "@sjcrh/proteinpaint-shared": "2.177.1-0", | ||
| "@sjcrh/proteinpaint-types": "2.177.1-0", | ||
| "@sjcrh/proteinpaint-rust": "2.177.0", | ||
| "@sjcrh/proteinpaint-shared": "2.177.0", | ||
| "@sjcrh/proteinpaint-types": "2.177.0", |
There was a problem hiding this comment.
Several internal dependencies were downgraded (e.g. @sjcrh/proteinpaint-python to 2.174.0) while other workspaces in this repo are at 2.177.1-0. In a workspaces setup, this can cause npm to pull older registry packages instead of using the local workspace versions, and will likely desync package-lock.json. Re-align these dependency versions with the monorepo release version (or use a workspace protocol) if the downgrade wasn’t deliberate.
|
@robinpaul85 @compbiolover @xzhou82 In order to minimize having test/spec code in prod, and instead of importing test code in
A different way to do this is have code similar to |
| function validate_scatter_output(output: SampleScatterType, expected: SampleScatterType): boolean { | ||
| if (output.plotName != expected.plotName) { | ||
| console.log( | ||
| 'SampleScatter plotName did not match. LLM response: ' + output.plotName + ' Expected: ' + expected.plotName |
There was a problem hiding this comment.
Should use assertions li, such as tape's test.equal(...). It would clean up the code and make tests more familiar and more easily migrated to integration or e2e spec file.
f8eeccc to
a36dd29
Compare
24c96ab to
db57a78
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…localhost:3000/testchat
9ce675d to
3dbb925
Compare
| ds | ||
| ) | ||
| console.log('test_result:', test_result) | ||
| if (test_result.type == 'html') { |
There was a problem hiding this comment.
use of detailed if-else is counter-intuitive. just strickly require test_result to be deepEqual to expected result object, and get rid of all these if-else
There was a problem hiding this comment.
should also use tape library to assert (like tape's test.deepEqual()); the test logs would be easier to read and pass/fail counts will be tracked automatically
There was a problem hiding this comment.
Yes I can use that. However, using deepEqual() on the entire object may raise false errors. For e.g. in case of prompts with invalid dictionary terms the LLM usually hallucinates in a non-reproducible manner (especially if we change the LLM itself). In such a scenario the deepEqual() will raise a false error.
| you may temporarily uncomment below to generate runtime route checker code, | ||
| should only uncomment when a file has been added or deleted in | ||
| shared/types/src/routes and not when modified. | ||
| */ |
There was a problem hiding this comment.
It looks like the indentations are being removed. Make sure prettier is being used to reformat staged code, as triggered by git pre-commit or pre-push hools.
Description
Implementation for automatically testing test prompts (WIP).
Checklist
Check each task that has been performed or verified to be not applicable.